RETHINKING INTERMEDIATE LAYERS DESIGN IN KNOWLEDGE DISTILLATION FOR KIDNEY AND LIVER TUMOR SEGMENTATION

Knowledge distillation (KD) has demonstrated remarkable success across various domains, but its application to medical imaging tasks, such as kidney and liver tumor segmentation, has encountered challenges. Many existing KD methods are not specifically tailored for these tasks. Moreover, prevalent KD methods often lack a careful consideration of ‘what’ and ‘from where’ to distill knowledge from the teacher to the student. This oversight may lead to issues like the accumulation of training bias within shallower student layers, potentially compromising the effectiveness of KD. To address these challenges, we propose Hierarchical Layer-selective Feedback Distillation (HLFD). HLFD strategically distills knowledge from a combination of middle layers to earlier layers and transfers final layer knowledge to intermediate layers at both the feature and pixel levels. This design allows the model to learn higher-quality representations from earlier layers, resulting in a robust and compact student model. Extensive quantitative evaluations reveal that HLFD outperforms existing methods by a significant margin. For example, in the kidney segmentation task, HLFD surpasses the student model (without KD) by over 10%, significantly improving its focus on tumor-specific features. From a qualitative standpoint, the student model trained using HLFD excels at suppressing irrelevant information and can focus sharply on tumor-specific details, which opens a new pathway for more efficient and accurate diagnostic tools. Code is available here.


INTRODUCTION
Tumor segmentation in medical imaging enables clinicians to accurately identify, assess, and manage malignancies.Leveraging neural networks, we achieve automated, high-fidelity delineation of tumor boundaries in various imaging modalities, including CT scans and MRIs [1,2,3].This technological breakthrough elevates diagnostic accuracy and efficiency and streamlines treatment planning [4], ultimately leading to enhanced patient care and outcomes.Sparsh and Ulas are corresponding authors.The project is supported by NIH funding: R01-CA246704, R01-CA240639, U01 DK127384-02S1, and U01-CA268808.The computing system used for this research was supported by IIT Roorkee under the grant FIG-100874.
Significant challenges persist despite the remarkable successes of deep-learning models in medical image segmentation [5,6,7,8].These models demand extensive datasets and substantial computational resources, making deployment on resource-limited devices a hurdle.Furthermore, the diversity in tumor appearances, irregular sizes, unpredictable locations, and variations amplifies segmentation complexity.To address these challenges, researchers are exploring innovative strategies.For instance, lightweight networks [9,10,11,12] have been explored for real-time semantic segmentation, and recent works have delved into real-time medical image segmentation.However, model simplification may hurt predictive performance.Knowledge distillation (KD) [13] has emerged as a valuable approach, facilitating knowledge transfer from the larger 'teacher' models to the leaner 'student' models.
Existing works [14,15,16,17,18,19,20] aim to enhance the final representations of the student model by minimizing the difference in softmax representations between the teacher and student models.However, this supervisory signal originates solely from the final student layer.Hence, it tends to attenuate with each layer during backpropagation, accumulating training bias within the shallower student layers.This impairs the efficacy of knowledge transfer.Other works [21,22,23,24] focus on improving the alignment of latent feature maps by mimicking intermediate representations.These intermediate representations serve as solid indicators that facilitate learning the final representation.However, when we replicate intermediate representations, we are limited to capturing the knowledge acquired by that specific layer, potentially missing out on global information.Recognizing this limitation, capturing features from terminal representation at an earlier stage emerges as a valuable strategy [25,26].However, these methods give suboptimal results where boundary, shape, texture information, and a combination of low-level features are essential, not only high-level class information.We have introduced hierarchical layer-selective feedback distillation (HLFD) to address these challenges.HLFD comprises Feature-level LFD (FLFD) and Pixel-level LFD (PLFD).FLFD, in turn, includes Unified Feature-level Distillation (UFD) for unified representations Fig. 1.The input X and augmentation X ′ , undergo encoding by both a pre-trained teacher encoder and a randomly initialized student encoder, resulting in representations z t early , z t midj , z t ter , and z s early , z s midj , z s ter , respectively.These representations contribute to feature-level loss functions, L U F D and L IF D .Additionally, the teacher decoder decodes z t ter , producing representations p t early, p t midj , p t ter , which are utilized in pixel-level loss functions, L U P D and L IP D .The training process is further enhanced with the inclusion of a supervised focal dice loss (L seg ).
and Individual Feature-level Distillation (IFD) for middleto-early and later-to-middle layer distillation.PLFD, on the other hand, involves Unified Pixel-level Distillation (UPD) and Individual Pixel-level Distillation (IPD), which transfers pixel-level knowledge from the teacher decoder to the student through interpolated features.HLFD integrates both FLFD and PLFD in a multi-task fashion, promoting simultaneous learning of feature-level and pixel-level representations.Our contributions are as follows, • We rethink the design of layers in the context of distillation and introduce the Hierarchical Layer-selective Feedback Distillation (HLFD) framework.• We demonstrate HLFD's capability to capture tumorspecific details from early layers while effectively suppressing irrelevant information flow.• Extensive experiments conducted on kidney and liver tumor segmentation tasks establish that our proposed method attains state-of-the-art (SOTA) results

Feature-level Layer-selective Feedback Distillation (FLFD)
Given an input X, transformations occur through both the pre-trained teacher encoder f t i and the random student en-coder f s i , denoted by i for the number of blocks.This yields early representations z t early and z s early , intermediate representations z t midj and z s midj (where j is the number of middle layers), and terminal representations z t late and z s late .These representations form the foundation of our framework.We propose an FLFD loss, defined as, L F = L U F D + L IF D .These components are defined below.
Unified Feature-level Distillation (UFD).Within this framework, we introduce the concept of distilling the attentive knowledge from the teacher's unified representation of middle layers, z t mid , to the student's early representation, z s early .To achieve this, we propose the following loss function.
To achieve z t mid , we perform interpolation on the middle layers with the larger feature maps to ensure their spatial dimensions match the smallest among them.Next, we concatenate all these interpolated representations along the channel dimension.Finally, operation A(.) is employed first to rescale the student's representation z s early to match the spatial dimension of the teacher's z t mid .Additionally, channel normalization is applied to the rescaled student representation, assuming that the absolute value of a neuron activation signifies its importance.
Individual Feature-level Distillation (IFD).Within this framework, we introduce the concept of distilling the attentive knowledge from the teacher's late representation, z t late , to each student's middle layers or intermediate representation, z s midj .To achieve this, we propose the following loss function.
Here, the operation A(.) is same as in Eq.1.

Pixel-level Layer-selective Feedback Distillation (PLFD):
In contrast to feature-level distillation, pixel-level segmentationmap distillation is geared toward conveying pixel-wise predictions.In practice, we distill pixel-level maps generated by the teacher's decoder to interpolated student maps.First, the teacher encoder output z t late is passed through pre-trained teacher decoder d t i resulting in early predictive map p t early , intermediate predictive map p t midj and terminal predictive map p t late .For students, we used an interpolated representation map.We propose L P = L U P D + L IP D .The components of PLFD are as follows.
Unified Pixel-level Distillation (UPD).We propose distilling the precise predictive information from the teacher's unified pixel-wise predictive maps of middle layers, p t mid , to the student's early interpolated representation, p s early .
Individual Pixel-level Distillation.Here, the teacher's terminal predictive map, denoted as p t late , distills precise information to the intermediate predicted maps of the students individually, represented as p s midj .This allows the student to capture detailed knowledge about the exact pixel locations and their corresponding class assignments within the image from much earlier layers.To achieve this, a KL-divergence loss is employed between these maps: Here, N is the number of middle-layer blocks in the student.

Hierarchical Layer-selective Feedback Distillation (HLFD)
Finally, distilling both feature-level and pixel-level representations allows the student to learn fine-to-coarse hierarchical details at both the feature and pixel levels.The multi-task loss function can be defined as: Where L seg is the focal dice loss used for training the student network in a supervised fashion.In the inference phase, postsufficient training, both the teacher network components and distillation modules are discarded.

EXPERIMENTAL PLATFORM
Datasets: We evaluated our techniques on kidney tumor segmentation (KiTS) [2] and liver tumor segmentation (LiTS) [27] datasets.KiTS comprises 210 abdominal CT scans, where a 168:42 split is used for testing and training.Similarly, the LiTS dataset consists of 201 CT scans and uses the split of 131:70.
Baselines: We compare our method with the following SOTA methods: i) Structured Knowledge Distillation (SKD) [18]: Involves pair-wise distillation to capture similarity at feature and pixel level.ii) Intermediate Feature Distiller (IFD) [25]: Distills the teacher's terminal representation into concatenated branches of the student model.iii) Deep Knowledge Distillation (DKD) [24]: Similar to [24] but without the Relational Knowledge distillation(RKD) module.iv) Hierarchical Individual Feedback Knowledge Distillation (HIFD) [26]: It distills the teacher's terminal representation to individual layers of the student.We extended this method for segmentation by incorporating pixel-level feedback distillation loss functions.We maintained identical implementation settings across all techniques.
Our segmentation networks and distillation processes, inspired by [24], were trained using Adam optimizer with beta1 (0.9) and beta2 (0.999).The learning rate began at 0.001, utilizing CosineAnnealing for rate scheduling, reaching a minimum of 0.000001.Data augmentation techniques such as random rotation and flipping were applied, while experiments revealed that Gaussian noise augmentation is unsuitable for medical images.Most networks processed authentic 512 × 512 CT images, requiring windowing of HU values with radiological standards (e.g., -40 to 160 for the liver and -200 to 300 for the kidney).We use the PyTorch framework.We train all the networks till convergence with up to 120 epochs.We report the result as mean ± std after three runs.For the Dice score (DSC), higher is better.For Relative Volume Difference (RVD), smaller absolute values are desired, indicating a closer match between the predicted and ground truth volumes.When comparing RVD values, a smaller absolute value (closer to zero) is better, regardless of whether the RVD is positive or negative.These metrics provide complementary insights about the performance.

RESULTS
Quantitative Results: As shown in Table 1, our method, HLFD, consistently outshines both the supervised student and the baseline models.Notably, on the KiTS dataset, we observe a substantial enhancement in DSC over the student (without KD).Further, both IFD and HIFD exhibit competitive or superior outcomes than other baselines.These results underscore the necessity of integrating KD and also emphasize the critical importance of architecting layers that adeptly distill the 'what' and 'where' dimensions of knowledge from the teacher model.On the RVD metric also, HLFD outperforms baselines, including the student (without KD), by a significant margin.This insight into volume differences holds valuable implications, especially in tasks like tumor segmentation where volume accuracy is of paramount importance.On the LiTS dataset, HLFD (our method) consistently outperforms the baselines on the DSC metric, whereas the SKD method is the best on the RVD metric.
Qualitative Results: The visualizations presented in Fig. 2 Kideny Tumor The GradCAM maps presented in Fig. 3 showcase distinct patterns among methods.SKD, which does not leverage intermediate layers, exhibits a flow of irrelevant information, hindering focus on tumor-specific details.While DKD shows some restriction of information, IFD and HIFD manage to suppress irrelevant details.However, they face challenges in focusing on tumor-specific information.In contrast, our method distinctly focuses on tumor-specific information without capturing irrelevant details.Sensitivity Analysis: From Table 2, doubling the β value (from 0.9 to 1.8) while maintaining λ constant led to a slight deterioration in performance.Conversely, increasing the value of λ (from 0.1 to 0.2) while keeping β constant showed a similar trend but with slightly improved performance compared to the previous case.This suggests that L F learns a rich representation of data, while L P learns the essential structure for the segmentation task.Therefore, maintaining β greater than λ is crucial for optimal results.The best performance was achieved with β = 0.9 and λ = 0.1.

CONCLUSION
We introduce a novel Knowledge Distillation (KD) framework for enhancing liver and kidney tumor segmentation, redefining knowledge selection to distill and the distillation source, and transitioning from teacher encoder layers to the student.Quantitatively, HLFD has demonstrated remarkable superiority over existing KD techniques and baseline models.Our method substantially improves DSC, particularly in kidney tumor segmentation, where HLFD surpasses the student model (without KD) by over 10%.Qualitatively, HLFD exhibits exceptional capabilities in suppressing irrelevant information while maintaining a sharp focus on tumorspecific details.The ability of HLFD to accurately segment the region of interest (ROI) at both intermediate and final layers showcases its effectiveness in enhancing the quality of segmentation representations for kidney and liver tumor segmentation.

COMPLIANCE WITH ETHICAL STANDARDS
This research study was conducted retrospectively using human subject data made available in open access [10,2].Ethical approval was not required as confirmed by the license attached with the open-access data.

CONFLICTS OF INTEREST
The authors have no relevant financial or non-financial interests to disclose.

FinalFig. 2 .
Fig. 2. The green color highlights the regions of interest (ROI) representing tumors.The segmentation maps are presented for both KiTS (first two rows) and LiTS (last two rows) datasets, with 'G.T.' denoting the ground truth.showcase the superior performance of our proposed HLFD method.HLFD accurately segments the Region of Interest (ROI) while effectively suppressing irrelevant information, even at intermediate layers.The previous techniques fail

FinalFig. 3 .
Fig. 3. Gradient-activated class maps for KiTS19, featuring CAM results for both the final and first layers.Remarkably, HLFD focuses on the tumor region, maintaining effectiveness in suppressing irrelevant information as the process advances to the final layer.

Table 2 .
Impact of β and λ on DSC